-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression bug in tutorial #2399
Conversation
>>> texts = [[word for word in document.lower().split() if word not in stoplist] | ||
>>> for document in documents] | ||
>>> texts = [ | ||
>>> [word for word in document.lower().split() if word not in stoplist] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to demonstrate the best practice for string tokenization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best practice is to use a dedicated library (e.g. NLTK)… do we want to complicate the tutorial?
Maybe we should.
But if we don't, this should be something really stupid (like it is now), so people don't accidentally copy-paste thinking it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more along the lines of https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's an option (but utils.simple_preprocess
better),
>>> texts = [ | ||
>>> [word for word in document.lower().split() if word not in stoplist] | ||
>>> for document in documents | ||
>>> ] | ||
>>> | ||
>>> # remove words that appear only once | ||
>>> frequency = defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't part of the PR, but it may be better to simplify this section as:
# untested code but "should" work
import collections
frequency = collections.Counter()
for text in texts:
frequency.update(text)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
@@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory: | |||
>>> # collect statistics about all tokens | |||
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a real corpus here? Also, we may want to show the best practice for string tokenization here.
That way, we can doctest this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left some minor comments about potential improvements. If they're out of scope for the current PR, it may be good to capture them in a separate issue so that they get done eventually. Let me know what you think.
There's a ton of improvements we could make to the tutorials, but they're out of scope for this PR. I opened #2424 instead. |
There is a minor bug in the Corpora and vector spaces: The import of
corpora
got removed by in #2192 here (I assume by accident).Identified by a confused user on our mailing list.
This PR reconstitutes the import, plus fixes the code style since I was at it.